34. Case Studies of Multicore Architectures II

多核架构案例研究 II
AP SHANTHI博士

本模块的目标是讨论两种不同风格的多核架构的显着特征，即。Sun 的 Niagara 和 IBM 的 Cell Broadband Engine。

前面的模块讨论了对多核处理器的需求，并以案例研究的形式讨论了英特尔多核架构的显着特性。本模块将讨论另外两个多核架构作为案例研究。我们将首先讨论 Sun 的多核架构。

计算机设计人员最初专注于通过使用我们之前讨论过的不同技术增加频率和利用指令级并行性来提高桌面处理环境中单线程的性能。然而，由于主内存延迟和应用程序固有的低 ILP 方面的限制，对单线程性能的重视已经显示出收益递减。这导致微处理器设计复杂性激增，并使功耗成为主要问题。由于这些原因，Sun Microsystems 的处理器采用完全不同的微处理器设计方法。Sun 没有关注单线程或双线程的性能，而是针对 Web 服务和数据库等商业服务器环境中的多线程性能优化了体系结构。使用多个执行线程，可以找到每个循环执行的内容。因此，可以显着提高吞吐量，并且处理器利用率要高得多。

Sun Microsystems 在 UltraSPARC T 系列中基本上推出了两种处理器 - 2005 年推出的名为 Niagara 的 UltraSPARC T1 使用 90 纳米技术及其后继产品，代号为 Niagara2 的 UltraSPARC T2，2007 年推出的使用 65 纳米技术。我们将详细讨论这些架构的显着特征。

Niagara 通过结合芯片多处理器和细粒度多线程的思想来支持 32 个硬件线程。许多线程的并行执行有效地隐藏了内存延迟。然而，拥有 32 个线程对内存系统提出了很高的要求，以支持高带宽。为了提供这种带宽，交叉互连方案将内存引用路由到所有线程共享的存储芯片级 2 缓存。四个独立的片上内存控制器提供超过 20

GB/s 的内存带宽。图 38.1 提供了 Niagara 处理器的显着特征。

Niagara 支持 32 个硬件执行线程。该架构将四个线程组织成一个线程组。该组共享一个处理管道，称为 SPARC 管道。Niagara 使用 8 个这样的线程组，从而在 CPU 上产生 32 个线程。每个 SPARC 管道都包含用于指令和数据的 1 级缓存。硬件通过将组中的其他线程调度到 SPARC 管道上来隐藏给定线程上的内存和管道停顿，并使用零周期切换惩罚。32 个线程共享一个 3 MB 的二级缓存。该缓存是 4 路存储和流水线化的。它是 12-way set-associative，以最大限度地减少来自许多线程的冲突未命中。crossbar 互连提供 SPARC 管道、L2 缓存组和 CPU 上的其他共享资源之间的通信链接。它提供超过 200 GB/s 的带宽。每个源-目标对都可以使用一个二入口队列，并且它可以在交叉开关中每路最多排队 96 个事务。交叉开关还提供了与 I/O 子系统通信的端口。横杆也是机器的内存排序点。内存接口为四通道双数据速率 2 (DDR2) DRAM，支持最大带宽超过 20 GB/s，容量高达 128 GB。图 38.2 显示了 Niagara 处理器的框图。支持最大带宽超过20Gbytes/s，容量高达128Gbytes。图 38.2 显示了 Niagara 处理器的框图。支持最大带宽超过20Gbytes/s，容量高达128Gbytes。图 38.2 显示了 Niagara 处理器的框图。

SPARC 管道：SPARC 管道是一个简单而直接的管道，由六个阶段组成。图 38.3 显示了 SPARC 管道。每个 SPARC 管道支持四个线程。每个线程都有一组唯一的寄存器以及指令和存储缓冲区。线程组共享 L1 缓存、转换后备缓冲区 (TLB)、执行单元和大多数流水线寄存器。管道由六个阶段组成——获取、线程选择、解码、执行、内存和写回。在取指阶段，访问指令缓存和指令 TLB (ITLB)。以下阶段通过选择方式完成缓存访问。关键路径由 64 个条目、完全关联的 ITLB 访问设置。线程选择多路复用器确定四个线程程序计数器 (PC) 中的哪一个应该执行访问。流水线每个周期取两条指令。高速缓存中的预解码位指示长延迟指令。

在线程选择阶段，线程选择多路复用器从可用池中选择一个线程向下游阶段发出指令。此阶段还维护指令缓冲区。如果下游阶段不可用，则可以将在提取阶段从缓存中提取的指令插入到该线程的指令缓冲区中。线程选择逻辑决定哪个线程在获取和线程选择阶段的给定周期中处于活动状态。如图 38.3 所示，线程选择信号对于获取和线程选择阶段是通用的。因此，如果thread-select阶段选择一个线程向decode阶段发出指令，那么fetch阶段也会选择相同的指令访问缓存。线程选择逻辑使用来自各个流水线阶段的信息来决定何时选择或取消选择线程。例如，线程选择阶段可以使用指令缓存中的预解码位来确定指令类型，而一些陷阱只能在后面的流水线阶段检测到。因此，指令类型会导致在线程选择阶段取消选择线程，而在内存阶段检测到的后期陷阱需要从线程中刷新所有较新的指令并在陷阱处理期间取消选择自身。前两个阶段的管道寄存器为每个线程复制。而在内存阶段检测到的后期陷阱需要从线程中刷新所有较新的指令并在陷阱处理期间取消选择自身。前两个阶段的管道寄存器为每个线程复制。而在内存阶段检测到的后期陷阱需要从线程中刷新所有较新的指令并在陷阱处理期间取消选择自身。前两个阶段的管道寄存器为每个线程复制。

对于 ADD 等单周期指令，Niagara 实现完全绕过来自同一线程的新指令以解决 RAW 依赖性。在加载结果对下一条指令可见之前，加载指令具有三个周期的延迟。这种长延迟指令可能会导致管道危险，解决这些问题需要暂停相应的线程，直到危险消除。因此，在加载的情况下，来自同一线程的下一条指令将等待两个周期以清除危险。

在多线程管道中，争夺共享资源的线程也会遇到结构性危害。具有每周期一条指令的吞吐量的资源（例如 ALU）不会造成危害，但每周期吞吐量少于一条指令的除法器存在调度问题。在这种情况下，任何必须执行 DIV 指令的线程都必须等到分隔符空闲。线程调度器通过为最近最少执行的线程提供优先级来保证访问分频器的公平性。尽管正在使用分配器，但其他线程可以使用空闲资源，例如 ALU、加载存储单元等。

线程选择策略是每个周期在可用线程之间切换，优先考虑最近最少使用的线程。由于加载、分支、乘法和除法等长延迟指令，线程可能变得不可用。由于缓存未命中、陷阱和资源冲突等管道停顿，它们也变得不可用。线程调度程序假定加载是缓存命中，因此可以推测性地从同一线程发出相关指令。然而，与可以发出非推测性指令的线程相比，这种推测性线程被分配了较低的指令发布优先级。

来自选定线程的指令进入解码阶段，执行指令解码和寄存器文件访问。支持的执行单元包括算术逻辑单元 (ALU)、移位器、乘法器和除法器。旁路单元处理必须在寄存器文件更新之前传递给相关指令的指令结果。ALU 和移位指令具有单周期延迟并在执行阶段生成结果。乘法和除法运算的延迟很长，会导致线程切换。

The load store unit contains the data TLB (DTLB), data cache, and store buffers. The DTLB and data cache access take place in the memory stage. Like the fetch stage, the critical path is set by the 64-entry, fully associative DTLB access. The load-store unit contains four 8-entry store buffers, one per thread. Checking the physical tags in the store buffer can indicate read after write (RAW) hazards between loads and stores. The store buffer supports the bypassing of data to a load to resolve RAW hazards. The store buffer tag check happens after the TLB access in the early part of write back stage.

Load data is available for bypass to dependent instructions late in the write back stage. Single-cycle instructions such as ADD will update the register file in the write back stage.

The L1 instruction cache is 16 Kbyte, 4-way set-associative with a block size of 32 bytes. A random replacement scheme is implemented for area savings, without incurring significant performance cost. The instruction cache fetches two instructions each cycle. If the second instruction is useful, the instruction cache has a free slot, which the pipeline can use to handle a line fill without stalling. The L1 data cache is 8 Kbytes, 4-way set-associative with a line size of 16 bytes, and implements a write-through policy. Even though the L1 caches are small, they significantly reduce the average memory access time per thread with miss rates in the range of 10 percent. Because commercial server applications tend to have large working sets, the L1 caches must be much larger to achieve significantly lower miss rates, so this sort of tradeoff is not favorable for area. However, the four threads in a thread group effectively hide the latencies from L1 and L2 misses. Therefore, the cache sizes are a good trade -off between miss rates, area, and the ability of other threads in the group to hide latency.

Niagara uses a simple cache coherence protocol. The L1 caches are write through, with allocate on load and no -allocate on stores. L1 lines are either in valid or invalid states. The L2 cache maintains a directory that shadows the L1 tags. The L2 cache also interleaves data across banks at a 64-byte granularity. A load that missed in an L1 cache (load miss) is delivered to the source bank of the L2 cache along with its replacement way from the L1 cache. There, the load miss address is entered in the corresponding L1 tag location of the directory, the L2 cache is accessed to get the missing line and data is then returned to the L1 cache. The directory thus maintains a sharers list at L1-line granularity. A subsequent store from a different or same L1 cache will look up the directory and queue up invalidates to the L1 caches that have the line. Stores do not update the local caches until they have updated the L2 cache. During this time, the store can pass data to the same thread but not to other threads; therefore, a store attains global visibility in the L2 cache. The crossbar establishes memory order between transactions from the same and different L2 banks, and guarantees delivery of transactions to L1 caches in the same order. The L2 cache follows a copy-back policy, writing back dirty evicts and dropping clean evicts. Direct memory accesses from I/O devices are ordered through the L2 cache. Four channels of DDR2 DRAM provide in excess of 20 Gbytes/s of memory bandwidth.

每个 SPARC 内核具有以下单元：

• 指令提取单元(IFU) 包括以下流水线阶段——提取、线程选择和解码。IFU 还包括一个指令缓存复合体。

• 执行单元(EXU) 包括管道的执行阶段。

   • 加载/存储单元(LSU) 包括内存和回写阶段，以及数据缓存复合体。

• 陷阱逻辑单元(TLU) 包括陷阱逻辑和陷阱程序计数器。

• 流处理单元(SPU) 用于加密的模块化算术函数。

• 内存管理单元(MMU)。

• 浮点前端单元(FFU) 与FPU 接口。

由此，我们从上面的细节中可以看出，Niagara 处理器实现了富线程架构，旨在为商业服务器应用提供高性能的解决方案。硬件支持 32 个线程，内存子系统由板载交叉开关、二级缓存和内存控制器组成，用于高度集成的设计，利用服务器应用程序固有的线程级并行性，同时以低功耗为目标。

接下来，我们将讨论 IBM 的 Cell Broad Band Engine (Cell BE) 架构。

IBM 的 Cell BE： Cell BE 是由 IBM、SCEI/Sony 和 Toshiba 开发的异构架构，其设计目标如下：

· Power 的加速器扩展 o 建立在 Power 生态系统之上

o 使用最知名的系统实践进行处理器设计

· 树立新的绩效标准

o 在实现高频的同时利用并行性

o 具有极端浮点能力的超级计算机属性

o 维持高内存带宽

· 自然人机交互设计

o 逼真的效果

o 可预测的实时响应

o 用于并发活动的虚拟化资源

· 灵活设计

o 广泛的应用领域

o 高度抽象为高度可利用的编程模型 o 可重新配置的 I/O 接口

o 用于安全的虚拟可信计算环境

The Cell BE is a heterogeneous architecture because it has different types of cores. The Cell BE logically defines four separate types of functional components: the PowerPC Processor Element – one core (PPE), synergistic processor units – eight cores (SPU), the memory flow controller (MFC), and the internal interrupt controller (IIC). The computational units in the CBEA-compliant processor are the PPEs and the SPUs. Each SPU has a dedicated local storage, a dedicated MFC with its associated memory management unit (MMU), and a replacement management table (RMT). The combination of these components is called a Synergistic Processor Element (SPE). Figure 38.4 shows the architecture of Cell BE. The element interconnect bus (EIB) connects the various units within the processor.

Power Processing Element: The PPE is a 64 bit, “Power Architecture” processor, with a 32KB primary instruction cache, a 32KB primary data cache and a 512KB secondary cache. It is a dual issue processor that is dual threaded with in-order static issue. It acts as the control unit for the SPEs. It runs the OS and most of the applications but compute intensive parts of the OS and applications will be offloaded to the SPEs. It is a RISC architecture that uses considerably less power than other PowerPC devices, even at higher clock rates. The major functional units of the PPE are given in Figure 38.5. It is composed of three main units – the Instruction Unit (IU), which takes care of the fetch, decode, branch, issue and completion, the Fixed-Point Execution Unit (XU), which handles the fixed – point instructions and load/store instructions and the Vector Scalar Unit (VSU), which handles the vector and floating point instructions. The PPE pipeline is illustrated in Figure 38.6.

协同处理元件 (SPE)： SPE 有八个。它们遵循 SIMD 风格的指令集架构，并针对计算密集型应用程序的功率和性能进行了优化。SPE 的各种功能单元如图 38.7 所示，SPE 管道如图 38.8 所示。一个 SPE 可以在一个时钟周期内对 16 个 8 位整数、8 个 16 位整数、4 个 32 位整数或 4 个单精度浮点数进行运算，以及一个内存操作。

SPE 具有用于指令和数据的本地存储存储器。这个 256KB 的本地存储是 SPE 的最大组件，并充当内存层次结构的附加级别 。它是一个单端口 SRAM，能够通过窄 128 位和宽 128 字节端口进行读写。SPE 对从本地存储读取或写入本地存储的寄存器进行操作。本地存储可以以最小 1KB（最大 16KB）的块访问主内存，但 SPE 不能直接作用于主内存（它们只能将数据移入或移出本地存储）。缓存可以提供类似甚至更快的数据速率，但只能在非常短的突发（最多几百个周期）中提供，每个本地存储都可以以该速率连续提供数据超过一万个周期，而无需进入 RAM。

Data and instructions are transferred between the local store and the system memory by asynchronous DMA commands, controlled by the Memory Flow Controller, which provides support for 16 simultaneous DMA commands. The SPE instructions insert DMA commands into queues which are then pooled as DMA commands into a single “DMA list” command. They support a 128 entry unified register file for improved memory bandwidth, and optimum power efficiency and performance .

The Element Interconnect Bus (EIB) provides the data ring for internal communication. You can form four 16 byte data rings of low latency, multiple simultaneous transfers at 96B/cycle peak bandwidth (at ½ CPU speed). The EIB is shown in Figure 38.9.

Cells 与普通 CPU 的一个很大区别是 Cell 中的 SPE 能够链接在一起以充当流处理器。在一个典型的使用场景中，系统会用小程序加载 SPE，将 SPE 链接在一起以处理复杂操作中的每一步。例如，数字电视的解码是一个复杂的过程，可以分解成一个流，如图 38.10 所示。数据将从 SPE 传递到 SPE，直到完成最终处理。另一种可能性是对输入数据集进行分区，并让多个 SPE 并行执行相同类型的操作。在 3.2 GHz 时，每个 SPE 的单精度性能理论上为 25.6 GFLOPS。

为了在功率受限的环境中显着提高应用程序性能，Cell BE 设计利用了所有级别的并行性。

数据级并行——具有 SPE 和 SIMD 指令支持
ILP – 使用静态调度和功耗感知微架构
计算传输并行性——使用可编程数据传输引擎
TLP – 在 PPE 上采用多核设计方法和硬件多线程
内存级并行——通过从每个内核和多个内核的多个请求重叠传输
 
因此，Cell BE 是基于对密码学、图形变换和照明、物理学、快速傅立叶变换 (FFT)、矩阵运算和科学工作负载等领域的广泛工作负载的分析而设计的，专为图形设计——和网络密集型工作，从视频游戏到医疗、国防、汽车和航空航天行业的复杂成像。

总而言之，我们在本模块中处理了两种不同风格的多核架构：

超SPARC
- 同质

– 多线程

– 服务器应用程序

– 节能

IBM 的 Cell BE
 
– 异构

– 专为游戏应用程序打造

– 利用不同类型的并行性

网页链接/支持材料

Computer Architecture – A Quantitative Approach，John L. Hennessy 和 David A. Patterson，第 5 版，Morgan Kaufmann，Elsevier，2011。
 
IBM、索尼、东芝在 ISSCC'05 上的论文
 
“用于 CELL 处理器的流处理单元”，B. Flachs 等。阿尔。
 
“第一代 CELL 处理器的设计和实现”，D. Pham 等。阿尔。
 
“微处理器报告”，
 
Reed Electronics Group，2005 年，1 月 31 日和 2 月 14 日
 
“IBM 的单元处理器：下一代计算？”，DK Every，共享软件出版社，2005 年 2 月
 
“高能效处理器架构和单元处理器”，HP Hofstee，HPCA-11 2005
 
“高能效处理器设计和单元处理器”，IBM，2005
 
“IBM/Sony/Toshiba Cell Processor 简介”，JH Stokes，http://arstechnica.com/
 
“细胞架构解释”，N. Blachford，http://www.blachford.info/
 
Kunle Olukotun、Poonacha Kongetira、Kathirgamar Aingaran，“Niagara：32 路多线程 Sparc 处理器”，IEEE Micro，vol. 25，没有。, pp. 21-29, March/April 2005, doi:10.1109/MM.2005.35
 
http://opensparc.sunsource.net/nonav/opensparct1.html
 
http://www.sun.com/processors/UltraSPARC-T1/features.xml
 
http://www.sun.com/servers/coolthreads/t1000/benchmarks.jsp
 
http://news.com.com/Sun+begins+Sparc+phase+of+server+overhaul/2163-1010_3-5983365.html